A sample workflow

Author

Lizzie, Victor

Published

April 2025

To show our workflow in action, we examine a subset of the Living Planet Index (LPI, download at: https://www.livingplanetindex.org/data_portal. This download should include a csv file, here: LPD2022_public.csv, that we use below). For more information on the LPI, check out the main text and www.livingplanetindex.org/.

Note: data from Jonathan comes from LivingPlanetIndexDatabase_2025-02-01_19.19.51.

1 Step 1: Research question

I started this, but it would be perfect for JD to fix/add to, so I did not finish it. (As mentioned above, I suggest you focus on getting some code/plots and basic text done that you then share next week with everyone.)

Here we need to focus on exactly what question we are trying to answer, which will guide us in the next steps. This step is sometimes skipped over or moved through so quickly that the exact question is not clear, but is critical for developing a useful model. We might start with: What is the global trend for biodiversity in vetebrate species? This is rather broad and might mean we should develop a model that will yield one global estimate. In thinking about this, we might realize we don’t want just that, we actually want to understand the variation across different species and perhaps populations and also a global estimate. So we might refine the question to:

What is global trend in vertebrate species population numbers, both overall and how does it vary acrosss species and populations?

wd <- "/home/victor/projects/forecastflows/analyses"

set.seed(20012029)

library(ggplot2)

lpi_subset <- read.table(file.path(wd, "jdavies/African_mammals.txt"))

lpi_subset <- lpi_subset[, c("ID", "Genus", "Species", "Units", paste0("X", 1950:2020))]
lpi_subset$genus_species <- paste(lpi_subset$Genus, lpi_subset$Species)

lpi_subset_rsh <- 
  reshape(lpi_subset, direction = "long", idvar = "ID",
          varying =  paste0("X", 1950:2020), v.names = c("value"), timevar = "year", times = 1950:2020)
lpi_subset_rsh <- lpi_subset_rsh[lpi_subset_rsh$value != "NULL",]

2 Step 2: Model building

Next we need to think about how we would build a model to address this question, with the goal to build the model and evaluate it using simulated data. There is a lot of demographic models in population ecology, so we might revisit those and consider a discrete time model based on population size and stepping through each year of data. We might also start with a simple linear model, where populations can go up, down or do nothing. That gets us started, but now we need to consider how we will model species and populations. Populations usually are nested in species in ecological models, so we might consider that to start. Our resulting model might then be:

\log(y_i) \sim \text{normal}(\mu_i, \sigma)

\mu_i = \alpha_{\text{sp}[i],\text{pop}[i]} + \beta_{\text{sp}[i],\text{pop}[i]} * \text{year}_i

\alpha_{\text{sp}[i],\text{pop}[i]} \sim \text{normal}(\mu_{\text{pop}[i]}^\alpha, \sigma_{\text{pop}[i]}^\alpha) \\ \mu_{\text{pop}[i]}^\alpha \sim \text{normal}(\mu_{\text{sp}[i]}^\alpha, \sigma_{\text{sp}[i]}^\alpha)

\beta{\text{sp}[i],\text{pop}[i]} \sim \text{normal}(\mu_{\text{pop}[i]}^\beta, \sigma_{\text{pop}[i]}^\beta) \\ \mu_{\text{pop}[i]}^\beta \sim \text{normal}(\mu_{\text{sp}[i]}^\beta, \sigma_{\text{sp}[i]}^\beta)

I’m not 100% sure with the notation here.

3 Step 3: Model evaluation

Now we need to simulate data to test our model. As we start to think about how to do this, we may become overwhelmed with the scale of the question we started with. We likely would not feel immediately sure how to simulate for all different vertebrate species.

3.1 Stepping backward…

Here we would likely feedback one to two steps. We could step back to Step 2 and consider adding a phylogeny or other structure from which we could then simulate data from, or we could step back to Step 1 and narrow the question (before potentially building it back up.) We believe the latter approach is often better as working on a small scale first can point out more fundamental problems versus starting big and then spending much longer to identify fundamental problems.

So here we refine our question to:

What is trend in mammal species population numbers in Africa, both overall and how does it vary across species and populations?

3.2 Simulating data

Now we start to simulate!

nspecies <- 9 # no. of species
npops_persp <- round(runif(nspecies, 2, 15)) # no. of different populations per species
npops <- sum(npops_persp) # total no. of different populations
nobs_perpop <- round(runif(npops, 10, 40)) # no. of different observations per population

spid_perpop <- rep(1:nspecies, times = npops_persp)
spid_perobs <- rep(spid_perpop, times = nobs_perpop)
popid_perobs <- rep(1:npops, times = nobs_perpop)

years <- c()
for(np in nobs_perpop){
  years_p <- sample(1950:2020, size = np, replace = FALSE)
  years <- c(years, years_p)
}

# nested intercepts
mu_species_alpha <- rnorm(n = nspecies, mean = log(200), sd = log(5)) # simulate some species intercepts
sig_species_alpha <- abs(rnorm(n = nspecies, mean = 0, sd = 2))
mu_pop_alpha <- rnorm(n = npops, mean = mu_species_alpha[spid_perpop], sd = sig_species_alpha[spid_perpop]) # population intercepts, nested in species
sig_pop_alpha <- abs(rnorm(n = npops))
alpha <- rnorm(n = npops, mean = mu_pop_alpha, sd = sig_pop_alpha)

# nested slopes
mu_species_beta <- rnorm(n = nspecies, mean = 0, sd = 0.01) # simulate some species slopes
sig_species_beta <- abs(rnorm(n = nspecies, mean = 0, sd = 0.05))
mu_pop_beta <- rnorm(n = npops, mean = mu_species_beta[spid_perpop], sd = sig_species_beta[spid_perpop]) # population slopes, nested in species
sig_pop_beta <- abs(rnorm(n = npops, mean = 0, sd = 0.01))
beta <- rnorm(n = npops, mean = mu_pop_beta, sd = sig_pop_beta)

# save simulated intercepts and slopes
pops_fit <- data.frame()
pid <- 1
for(sp in 1:nspecies){
  for(p in 1:npops_persp[sp]){
    pops_fit <- rbind(pops_fit, data.frame(species = sp, pop = pid))
    pid <- pid + 1
  }
}
pops_fit$int <- mu_pop_alpha
pops_fit$slp <- mu_pop_beta

sigma_obs <- abs(rnorm(1, mean = 0.25, sd = 0.25)) # common observational error

y <- c()
for(p in 1:npops){
  nobs_p <- nobs_perpop[p]
  alpha_p <- alpha[p]
  beta_p <- beta[p]
  
  years_p <- years[popid_perobs == p] 
  
  yhat_p <- alpha_p + beta_p * (years_p - 1980)
  # yhat_p <- alpha_p
  y_p <- exp(rnorm(n = nobs_p, mean = yhat_p, sd = sigma_obs))
  y <- c(y, y_p)
}


datasim <- data.frame(
  y, 
  year = years - 1980,
  species = as.character(spid_perobs),
  population = paste0(spid_perobs,popid_perobs)
)

datasim$logy <- log(datasim$y)

ggplot(data = datasim) +
  geom_line(aes(x = year+1980, y = y, 
                group = population,
                color = species)) +
  theme_bw() +
  labs(x = 'Year', y = 'Simulated counts', colour = 'Species')

ggplot(data = datasim) +
  geom_line(aes(x = year+1980, y = logy, 
                group = population,
                color = species)) +
  theme_bw() +
  labs(x = 'Year', y = 'Simulated counts (log-scale)', colour = 'Species')

3.3 Fit model on simulated data!

nestlm <- lm(logy ~ species/population + year * species/population, data = datasim)
coefs <- summary(nestlm)$coefficients

3.3.1 Retrodictive check with model fitted on simulated data

retrodata <- data.frame(obs = datasim$logy, pred = predict(nestlm),
                        year = datasim$year,
                        species = datasim$species, id = datasim$population)

ggplot(data = retrodata) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "grey") +
  facet_wrap(~ paste0("Species ", species), scales = "free") +
  geom_point(aes(x = obs, y = pred, group = id, color = as.character(id)),
             size = 0.5) + 
  theme_bw() + theme(legend.position = 'none') +
  labs(x = 'Simulated counts (log-scale)', y = 'Estimated counts (log-scale)')

ggplot(data = retrodata) +
  facet_wrap(~ paste0("Species ", species), scales = "free") +
  geom_point(aes(x = year, y = obs, group = id, color = as.character(id)),
             size = 1) + 
  geom_line(aes(x = year, y = pred, group = id, color = as.character(id))) +
  theme_bw() + theme(legend.position = 'none') +
  labs(x = 'Year', y = 'Counts (log-scale)')

ggplot(data = retrodata) +
  facet_wrap(~ paste0("Species ", species), scales = "free") +
  geom_point(aes(x = year, y = exp(obs), group = id, color = as.character(id)),
             size = 1) + 
  geom_line(aes(x = year, y = exp(pred), group = id, color = as.character(id))) +
  theme_bw() + theme(legend.position = 'none') +
   labs(x = 'Year', y = 'Counts')

3.3.2 Inference with model fitted on simulated data

The model does a fair job at recovering population-level parameters.

em <- data.frame(emmeans::emmeans(nestlm, ~ species/population))
NOTE: A nesting structure was detected in the fitted model:
    population %in% species
NOTE: Results may be misleading due to involvement in interactions
pops_fit$int_fit <- em$emmean
pops_fit$int_fit.lcl <- em$lower.CL
pops_fit$int_fit.ucl <- em$upper.CL

em <- data.frame(emmeans::emtrends(nestlm,  ~ species/population, var = 'year'))
NOTE: A nesting structure was detected in the fitted model:
    population %in% species
pops_fit$slp_fit <- em$year.trend
pops_fit$slp_fit.lcl <- em$lower.CL
pops_fit$slp_fit.ucl <- em$upper.CL

ggplot(data = pops_fit) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "grey") + 
  geom_pointrange(aes(ymin = int_fit.lcl, ymax = int_fit.ucl, x = int, y = int_fit, color = as.character(species))) +
  theme_bw() +
  labs(x = 'Simulated intercepts', y = 'Estimated intercepts', colour = 'Species')

ggplot(data = pops_fit) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "grey") + 
  geom_pointrange(aes(ymin = slp_fit.lcl, ymax = slp_fit.ucl, x = slp, y = slp_fit, color = as.character(species))) +
  theme_bw() +
  labs(x = 'Simulated slopes', y = 'Estimated slopes', colour = 'Species')

4 Step 4: Fit model to empirical data

lpi_subset <- read.table("/home/victor/projects/forecastflows/analyses/jdavies/African_mammals.txt")

lpi_subset <- lpi_subset[, c("ID", "Genus", "Species", "Country", "Units", paste0("X", 1950:2020))]
lpi_subset$genus_species <- paste(lpi_subset$Genus, lpi_subset$Species)

lpi_subset_rsh <- 
  reshape(lpi_subset, direction = "long", idvar = "ID",
          varying =  paste0("X", 1950:2020), v.names = c("value"), timevar = "year", times = 1950:2020)
lpi_subset_rsh <- lpi_subset_rsh[lpi_subset_rsh$value != "NULL",]
lpi_subset_rsh$value <- as.numeric(lpi_subset_rsh$value)

The values in the database are not necessarily in the same unit, so I imagine we need to take a decision here..?

units_to_keep <- c("Estimate - entire population", "Estimate - entire population (August)",
                   "Individual", "individuals", "Individuals", "Number of adults", "number of individuals",
                   "Number of individuals", "Number of Individuals", "Number of rhinos", "numbers of indviduals",
                   "population estimate", "Population estimate", "Population estimate (individuals)", 
                   "Population estimates", "Total population size")

lpi_subset_rsh <- lpi_subset_rsh[lpi_subset_rsh$Units %in% units_to_keep,]

ggplot(data = lpi_subset_rsh) +
  geom_line(aes(x = year, y = value, 
                group = paste0(genus_species, ID),
                color = genus_species)) +
  theme_bw() +
  labs(x = 'Year', y = 'Observed counts')

ggplot(data = lpi_subset_rsh[lpi_subset_rsh$value != 0,]) +
  geom_line(aes(x = year, y = log(value), 
                group = paste0(genus_species, ID),
                color = genus_species)) +
  theme_bw() +
  labs(x = 'Year', y = 'Observed counts (log-scale)')

runmodel <- FALSE

lpi_subset_rsh <- lpi_subset_rsh[lpi_subset_rsh$value != 0,] # poor way to avoid log(0)
lpi_subset_rsh$year <- lpi_subset_rsh$year-1980
lpi_subset_rsh$logvalue <- log(lpi_subset_rsh$value)
lpi_subset_rsh$ID <- as.character(lpi_subset_rsh$ID)

if(runmodel){
  nestlm <- lm(logvalue ~ genus_species/ID + year * genus_species/ID, data = lpi_subset_rsh)
  saveRDS(nestlm, file = file.path(wd, "output/fit/nestmodel1.rds"))
}else{
  nestlm <- readRDS(file.path(wd, "output/fit/nestmodel1.rds"))
}

Here we finally get to see how our model performs on the empirical data and are a step closer to answering our question, but we also start to notice some differences in our empirical data from our simulated data. In preparing our data for the model, we realize we have quite different amplitudes of variations across populations and species (on a log-scale).

ggplot(data = lpi_subset_rsh[lpi_subset_rsh$value != 0 & lpi_subset_rsh$genus_species == "Connochaetes taurinus",]) +
  geom_line(aes(x = year, y = log(value), 
                group = paste0(genus_species, ID),
                color = genus_species)) +
  theme_bw() + theme(legend.position = 'none') +
  labs(x = 'Year', y = 'Observed counts (log-scale)')

Here, we might step back and adjust our model (re-doing Step 2-3 again), but in this case we go ahead.

5 Step 5: Retrodictive check

Now we use our fitted model parameters to generate predictions, and compare them to observations.

retrodata <- data.frame(obs = lpi_subset_rsh$logvalue, pred = predict(nestlm),
                        year = lpi_subset_rsh$year, units = lpi_subset_rsh$Units,
                        species = lpi_subset_rsh$genus_species, id = lpi_subset_rsh$ID,
                        country = lpi_subset_rsh$Country)

ggplot(data = retrodata) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "grey") +
  facet_wrap(~ species, scales = "free") +
  geom_point(aes(x = obs, y = pred, group = id, color = as.character(id)),
             size = 0.5) + 
  theme_bw() + theme(legend.position = 'none') +
  labs(x = 'Observed counts (log-scale)', y = 'Estimated counts (log-scale)')

We might examine species separately, e.g. here the African bush elephant.

ggplot(data = retrodata[retrodata$species == "Loxodonta africana", ]) +
  facet_wrap(~ country, scales = "free_y") +
  geom_point(aes(x = year + 1980, y = obs, group = id, color = id),
             size = 1) + 
  geom_line(aes(x = year + 1980, y = pred, group = id, color = id)) +
  theme_bw() + theme(legend.position = 'none') +
  labs(y = 'Counts (log-scale)', x = 'Year')

ggplot(data = retrodata[retrodata$species == "Loxodonta africana", ]) +
  facet_wrap(~ country, scales = "free_y") +
  geom_point(aes(x = year + 1980, y = exp(obs), group = id, color = id),
             size = 1) + 
  geom_line(aes(x = year + 1980, y = exp(pred), group = id, color = id)) +
  theme_bw() + theme(legend.position = 'none') +
  labs(y = 'Counts', x = 'Year')

We can focus on populations in South Africa.

ggplot(data = retrodata[retrodata$species == "Loxodonta africana" & retrodata$country == "South Africa", ]) +
  facet_wrap( ~ id, scales = "free_y") +
  geom_point(aes(x = year + 1980, y = obs, group = id, color = units),
             size = 0.7) + 
  geom_line(aes(x = year + 1980, y = obs, group = id, color = units),
             linewidth = 0.3) + 
  geom_line(aes(x = year + 1980, y = pred, group = id),
            linewidth = 0.5) +
  theme_bw() + 
  labs(y = 'Counts (log-scale)', x = 'Year', color = 'Units')

ggplot(data = retrodata[retrodata$species == "Loxodonta africana" & retrodata$country == "South Africa", ]) +
  facet_wrap( ~ id, scales = "free_y") +
  geom_point(aes(x = year + 1980, y = exp(obs), group = id, color = units),
             size = 0.7) + 
  geom_line(aes(x = year + 1980, y = exp(obs), group = id, color = units),
             linewidth = 0.3) + 
  geom_line(aes(x = year + 1980, y = exp(pred), group = id),
            linewidth = 0.5) +
  theme_bw() + 
  labs(y = 'Counts', x = 'Year', color = 'Units')

The three populations are measured in the same units, but likely represent different spatial scales (ranging from a few hundred individuals to more than 10000).

The same holds for other species and other countries, e.g. the blue wildebeest in Tanzania.

ggplot(data = retrodata[retrodata$species == "Connochaetes taurinus" & retrodata$country == "Tanzania, United Republic Of", ]) +
  facet_wrap( ~ id, scales = "free_y") +
  geom_point(aes(x = year + 1980, y = exp(obs), group = id, color = units),
             size = 0.7) + 
  geom_line(aes(x = year + 1980, y = exp(obs), group = id, color = units),
             linewidth = 0.3) + 
  geom_line(aes(x = year + 1980, y = exp(pred), group = id),
            linewidth = 0.5) +
  theme_bw() + 
  labs(y = 'Counts', x = 'Year', color = 'Units')

The four populations are also measured in the same units (I guess?!), but represents different scales (so the blue one is likely a population, while the others represent region/country level?).

6 Feedbacks

At this point, we would have a lot of ideas of ways to improve this model that we would want to follow up on, including questioning our assumption of one sigma for all populations and potentially our linearity assumption. One possibility would be to move to a non-linear regression, and use a more flexible model Quick example for one of the wildbeest dataset in Tanzania:

gompodel <- nls(logvalue ~ SSgompertz(year, Asym, b2, b3), data=lpi_subset_rsh[lpi_subset_rsh$ID == 5589,])
fit <-data.frame(pred = predict(gompodel, newdata = data.frame(year = seq(-25,30,1))), year = seq(-25,30,1))

ggplot(data = lpi_subset_rsh[lpi_subset_rsh$ID == 5589,]) +
  facet_wrap( ~ ID, scales = "free_y") +
  geom_point(aes(x = year + 1980, y = value),
             size = 0.7, color = "#F8766D") + 
  geom_line(aes(x = year + 1980, y = value),
             linewidth = 0.3, color = "#F8766D") + 
  geom_line(aes(x = year + 1980, y = exp(pred)),
            linewidth = 0.5, data = fit) +
  theme_bw() + 
  labs(y = 'Counts', x = 'Year', color = 'Units')

We also may want to add predictors for the slope such as … We would do this, returning to Steps 2-5 likely several times until we have a model that is good enough for our purposes here.

And then we would likely add in more countries and species….

I might also add a short paragraph about moving away from linear regression and towards models for species abudance (via mixtures I think? I suspect the articles I posted for class have some good ideas) and thinking more about linking to demography models.

And then we would realize we really don’t have great data for this question and we would …